[dataloader] Fix text filtering bug and speed up spectrum length calc #216

lsrami · 2024-05-22T06:10:10Z

1.修复了data_utils.py中的_filter时对文本长度的判断，原先未对文本长度split()，把空格也计算到文本长度了
2.更新了data_utils.py中的_filter时对音频采样率的计算从torchaudio替换为soundfile库，速度增加了5倍；
3.添加了data_utils.py中的_filter时进度条显示
4.添加 tools/compute_spec_length.py预先多线程计算频谱特征的长度，节省了数据加载时间，只需将原始的train.txt的文件格式变成 filename|speaker|text|spec_length，即用第四列表示特征的长度

…ulation

lsrami · 2024-05-22T06:18:22Z

使用1000条音频获取音频采样率的速度对比

python compare_audio.py 1000.txt
soundfile: 100%|███████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:01<00:00, 537.85it/s]
soundfile:  1.861177682876587
torchaudio: 100%|███████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:10<00:00, 99.78it/s]
torchaudio:  10.02232575416565

可以看到，soundfile的速度是torchaudio的5倍

假设训练集有100w条音频，在训练中提取采样率计算特征长度，dataloader阶段需要约9小时, 改成使用tools/compute_spec_length.py预先计算特征长度，dataloader阶段仅需要1分钟

[dataloader] Fix text filtering bug and speed up spectrum length calc…

9197232

…ulation

[fix] Fix code style check

2cdde9a

pengzhendong merged commit 97b83e8 into wenet-e2e:main May 22, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dataloader] Fix text filtering bug and speed up spectrum length calc #216

[dataloader] Fix text filtering bug and speed up spectrum length calc #216

lsrami commented May 22, 2024

lsrami commented May 22, 2024

[dataloader] Fix text filtering bug and speed up spectrum length calc #216

[dataloader] Fix text filtering bug and speed up spectrum length calc #216

Conversation

lsrami commented May 22, 2024

lsrami commented May 22, 2024